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REMARKS 

Applicants respectfully request reconsideration in view of the above amendments and the 
following remarks. Applicants add claims 22-26, entry is requested. Applicants do not amend or 
cancel any claims. Accordingly, claims 1-26 remain pending in the application. 



I, Amendments to the Claims 

Applicants add claims 22-26, which depend from claims 1, 6, 12, 15 and 19, respectively. 
Support for the new dependent claims may be found in the specification at fj [0025] and [0029]. 
The amendments to the claims are therefore supported by the specification and do not add new 
matter. For at least the foregoing reasons, Applicants respectfully request consideration and 
entry of the attached amendments to the claims. 

H. Claims Rejected Under 35 U.S.C. 5 102 

Claims 6-21 stand rejected under 35 USC section § 102(e) as being anticipated by U.S. 
Patent No. 6,807,621 issued to Strombergson et al. (hereinafter "Strombergson''). 

To anticipate a claim, a single reference must disclose each element of that claim. Li 
regard to claim 6, this claim includes the elements of "tracking program order of the first set of 
instructions relative to the second set of instructions in a global reorder buffer." Applicants 
believe that Strombergson does not teach these elements of claim 6. Examiner cites commit 
stage 5 in Fig. 1 of Strombergson . and asserts that "since the commit stage contains a reorder 
buffer (ROB 10), the stage is responsible ... for tracking program order from all the execution 
stages." Examiner further asserts that "commit stage 5 . . . must obtain each of the instructions 
previously contained in the local buffers. The contents of these local buffers must be merged 
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precisely back into the program order for the processor to work appropriately." Examiner 
concludes that therefore, "[t]he relative order must be tracked; otherwise, these instmctions could 
not possibly be merged back into the original program order/* 

Applicants assume that Examiner is asserting that the elements of claim 6 are inherently 
taught by Strombergson . since Examiner has not relied on Strombergson for explicit disclosure 
of "tracking program order of the first set of instructions relative to the second set of instructions 
in a global reorder buffer." As noted previously on page 1 1 of the Amendment and Response to 
Office Action filed by Applicants on February 24, 2006, Examiner's assertions rely on 
information extrinsic to the cited reference and are not proper bases for an anticipation rejection. 

Further, it is not inherent in Strombergson that the relative order is tracked by commit 
stage 5 . For example, assume: that the decoding stage 2 f Strombergson, Fig. 1) places an entry 
for the instruction in reorder buffer 10 of commit stage 5. See Patterson & Hennessy, Computer 
Organization & Design 517 (2d ed 1998) (submitted herewith). In such a case, although commit 

stage-5-is-keeping-track-ofth^ 

instructions already in program order, from decoding stage 2 to reorder buffer 10. Thus, in this 
example, the commit stage is keeping track of the overall program order (or absolute order) via 
the reorder buffer, while the commit stage does not track program order of a first set of 
instructions relative to a second set of instructions, to properly maintain program order of the 
instructions. 

Further, a standard reorder buffer does not track the relative order of instructions that 
have been assigned to different execution units. Rather, a standard reorder buffer is a queue in 
which the entire sequence of the instructions is placed that are being operated on by all execution 
units. As the execution units complete their execution of instructions the results are placed in the 
appropriate slot Of the queue in the standard reorder buffer. See High Performance Micro 
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Processors: Chapter 8 at wwwxs.swan^c.ak/^csYie al/hpm/reorder.htrnl (last visited Feb. 24, 
2006) (and submitted herewith). The Examiner has not indicated and the Applicants have been 
unable to discern any pan of Sfrombergson that teaches that the reorder buffer in the commit 
stage is anything other than a standard reorder buffer. 

Therefore, Strombereson does not teach each of the elements of claim 6. Accordingly, 
reconsideration and withdrawal of the anticipation rejection of claim 6 are requested. 

Claims 7-1 1 depend from independent claim 6 and incorporate the limitations thereof. 
Thus, at least for the reasons mentioned above in regard to independent claim 6, these claims are 
not anticipated by Stromberfison . Accordingly, reconsideration and withdrawal of the 
anticipation rejection of these claims are requested. 

In regard to independent claims 12, 15 and 19, these claims include elements similar to 
those of independent claim 6 including "a global reorder buffer to track instruction order of 
instructions assigned to the first reorder buffer relative to the second reorder buffer,'* "tracking 
program order of the first set of instructions relative to me secondretiof-instructions-in-a-gjobal— 
tracking device" and u a means for directing program order of the first set of instructions relative 
to the second of instructions and the global tracking device." Thus, at least for the reasons 
mentioned above in regard to independent claim 6, Stnombergson does not anticipate each of 
these claims. Accordingly, reconsideration and withdrawal of the anticipation rejection of these 
claims are requested 

Claims 13, 14, 16-18, 20 and 21 depend from independent claims 12, 15 and 19, 
respectively, and incorporate the limitations thereof. Thus, at least for the reasons mentioned 
above in regard to independent claims 12, 15 and 19, these claims are not anticipated by 
Strombergson . Accordingly, reconsideration and withdrawal of the anticipation rejection of 
these claims are requested. 
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Claims 1-5 stand rejected under 35 USC section §102(e) as being anticipated by 
Strombergson. 

In regard to claim 1, this claim includes elements similar to those of independent claim 6 
including "a thiTd device coupled to the first device and second device to track relative segment 
order between the first device and the second device." Thus, at least for the reasons mentioned 
above in regard to independent claim 6, Strombereson does not anticipate claim 1. Accordingly, 
reconsideration and withdrawal of the anticipation rejection of this claim are requested. 

Claims 2-5 depend from independent claim 1 and incorporate the limitations thereof. 
Thus, at least for the reasons mentioned above in regard to independent claim 1, these claims are 
not anticipated by Strombereson . Accordingly, reconsideration and withdrawal of the 
anticipation rejection of these claims are requested. 
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CENTRAL FM CENTER 

OCT 3 1 2006 

CONCLUSION 

In view of the foregoing, it is believed that all claims now pending, namely claims 1-26, 
patentably define the subject invention over the prior art of record, and are in condition for 
allowance and such action is earnestly solicited at the earliest possible date. If Examiner believes 
that a telephone conference would be useftj in moving the application forward to allowance, 
Examiner is encouraged to contact the undersigned at (3 10) 207-3800. 



Dated: October 3 1 . 2006 



Respectfully submitted, 

BLAKELY, SOKOLOFF, TAYLOR & ZAFMAN LLP 




( No, 48,534 



12400 Wilshbc Boulevard, Seventh Floor 



Lob Angeles, California 90O25 
(310)2070800 



-I-hejdbyxerfify^hat-t^ 

transmitted via facsimile to 57 1-273-6300 addressed to; 
Mail Stop AF, Commissioner for Parents* P.O. Box 14S0, 
Alexandria, VA 22313-1450. 



Annie McNally 



Date 
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Chapter 8: Reorder Buffers and Register 
Renaming 



We have seen a range of possible ways that execution in a pipelined/superscalar machine could be 
delayed, but we have looked at actual mechanisms for reducing the effects of only some of them. For 
example, various strategies for procedural dependency (e.g. branch prediction, etc.), and have 
mentioned (if briefely) a strategy for resource conflict (duplication of resources). We have done nothing 
for those conflicts that involve data - Le. true data dependency, output dependency and antidependency. 
We have not even made it clear, except informally, how we would go about recognizing their presence. 
In this chapter and the ne& we will start to address this. In particular, in this chapter, we will look at part 
of the solution to eliminating WAW and WAR hazards arising from output and antidependency, and 
mechanisms to enforce a precise architectural state. It turns out that the same hardware is involved in 
both of these steps* 

As we saw earlier true data dependency is a property of a program, but name dependencies are not and 
can potentially be eliminated. Usually name dependencies involve conflicts in register use. One possible 
approach is to provide as many registers as possible, in an attempt to avoid clashes by giving the 
compiler plenty of choice. (This is basically the same solution used for resource conflicts - provide more 
resources,) 

One problem is backward compatability with existing architectures - these often have a limited number 
of registers, and there is no way to increase this number without radically redefining the architecture. 
Another problem is that a large register file means more work saving/restoring it when changing 
betweenprocesses {context-switching)- 

8.1* Register Renaming 

A devious strategy is register renaming - the hardware has a largflsh) set of registers - often several 
times as many as the actual architecture claims to have. These registers are not associated permanently 
with the registers of the architecture, but are dynamically allocated as needed. Furthermore, there can be 
several versions of an architectural register present at any one time. Consider the following code; 



Consider one problem: instruction 3 cannot go ahead until instruction 1 has finished, and instruction 2 
has started. This is because there is an output dependency between instruction's 1 and 3 - both write to 
R2 - and an antidependency between instructions's 2 and 3 - instruction 3 overwrites instruction 2 f s 
argument. Now consider the same program, but with the registers labelled. 



MUL R2,R2,R3 
ADD R4,R2,1 
ADD R2,R3,1 
DIV R5,R2,R4 



R2 s R2 * R3 

R4 = R2 + 1 

R2 m R3 + 1 

R5 = R4 * R4 



MUL R2_b,R2_a,R3_a 
ADD R4_b,R2_b,l 
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ADD R2_c,R3_a,l 
DIV R5Ja,R2_C,R4_b 



Now instruction 3 can start immediately, because it is using a 'different' R2 from instructions 1 and 2. 
We are effectively using a history of the contents of each register - for example, R2_c is the newest 
version of R2, then R2 Ja, followed by R2_a (the oldest version). 

There are two ways we can go about implementing register renaming. The first is by explicitly providing 
a larger set of registers than the architecture claims is present - a technique usually simply called register 
renaming. Alternatively (and more usually) by using a reorder buffer We will look at reorder buffers 
first, and then briefly consider the alternative. 



8.2. Reorder Buffers 



Renaming based on a reorder buffer uses a physical register file that is the same size as the architectural 
register file, together with a set of registers arranged as a queue data structure, called the reorder buffer. 

As instructions are issued, they are assigned entries for any results they may generate at the tail of the 
reorder buffer. That is, a place is reserved in the queue. We maintain the logical order of instructions 
within this buffer - so if we can issue four instructions i to i+3 at once, we put i in the reorder buffer 
first, followed by i+l, i+2 and i+3. As instruction execution proceeds, the assigned entry will 
ultimately be filled in by a value, representing the result of the instruction. When entries reach the head 
of the reorder buffer, provided theyVe been filled in with their actual intended result, they are removed, 
and each value is written to its intended architectural register. If the value is not yet available, then we 
must wait until it is. Because instructions take variable times to execute, and because they may be 
ex ecute d out of program order, we may well find that the reorder buffer entry at the head of the queue is 



must stay in the reorder buffer until the head instruction completes. For example, consider the case of 6 
instructions 11-16. Suppose at a given clock cycle that II and 12 both finish, and that at earlier clock 
cycles 14 to 16 also finished, but 13 is yet to complete. We can move the results for 11 and 12 out of the 
reorder buffer into their respective architectural registers* However, 14 to 16 must wait until 13 has 
completed EigJL illustrates the basic idea. 



Ttese results available Tor 
ro warding/bypassing to other 
Instructions 

y ^ s 



These results can be moved 
to their architectural destinations 





17 


16 


15 


14 


13 


12 


It 




? 


OK 


OK 


OK 


? 


OK 


OK 



Waiting for Instruction results 
V 



v 

These results cannot he moved to their 
architectural destinations 



PAGE 17/22 * RCVD AT 10/31/2006 5:44:32 PM (Eastern Standard Time] * SVR:USPT0-EFXRF-2ff * DNIS:2738300 * CSID:3108205988 * DURATION (mm^iOM^ort 



Oct-31-06 03:41pm Frora-BSTiZ 



310 820 5988 



T-249 P. 01 9/023 F-271 



Fig.l. A Reorder Buffer 

As described, the reorder buffer does not solve our problems - though is does solve another one (see 
below Y However, even though the results of some instructions (15 and 16 above) cannot be moved to 
their architectural destinations, they can still be used in computations. Suppose, in the example above 
that we execute a further instruction 17, which uses some register R2 as an operand Suppose further that 
the result of 15 will eventually be stored in R2, and this is the actual value required by 17. Even though 
the value in the reorder buffer computed by 15 has not yet been moved to R2, we can still use it in the 
computation of 17. This process is called forwarding or bypassing, and we have mentioned it already 
when we considered basic pipelini ng - though not seen how to implement it The reorder buffer 
effectively provides the history mechanism required for register renaming. The oldest version of a 
register is that stored in the architectural register, the next oldest is that nearest the head of the reorder 
buffer; the youngest is that nearest the tail of the reorder buffer. 

(In practice of course, the details of implementation are not that straightforward Reorder buffer entries 
need to store considerable amounts of information about instruction results - the instruction, its eventual 
destination, whether the result is valid or not - and it is important to be able to access all this information 
quickly. In order to avoid stalling the pipeline, we must be able to quickly identify when a result - 
needed by another instruction - becomes available, and also fetch it quickly.) 

8*2*1* Maintaining a Precise Architectural States 

The other problem that reorder bu ffers solve is that of maintaining a precise architectural state. Recall 
from an earlier chapter the problem of an instruction i+i terminating before instruction i, and then i 
causing an error. Reorder buffers solve this by effectively keeping instruction results provisional until 
earlier ones are known - provided instructions are actually issued in program order. Suppose in the 
example above that instruction 13 caused an error. We simply discard the contents of the reorder buffer 

(mdudmg-the akeady-exew^ 

waste some work (redoing 14 to 16), but we maintain backward compatibility, and a precise architectural 
state (because only the results of 1 1 and 12 will have been written out to architectural registers). Note 
that this is only actually the case if we issue the instructions in program order* and hence the order of the 
entries in the reorder buffer reflect program order. However, recall that we generally do issue 
instructions in order, 

A final point to note is that reorder buffers do not cause any additional workload when we do a context 
switch, since the contents of the reorder buffer do not have to be saved. We do of course have to wait for 
any outstanding instructions to leave the buffer (or just dump the results and repeated the dumped 
instructions), but this is much quicker than the (slow) process of saving registers to memory. 

8.3. Another Register Renaming Strategy 

Reorder buffers are convenient and simple (at least conceptually). They are also widely used (for . 
example, all Pfr-bqsed processors (Pentium Pro, Pentium II and all Pentium Ill-series processors use 
reorder buffers). However, they are not completely without disadvantages. For example, they add an 
extra step to the pipeline - moving results from reorder buffer to architectural registers. What is more, 
there is generally a limit on the number of entries that can be moved simultaneously. For example, 
suppose there are six execution units. In principle, this means six instructions can complete 
simultaneously. In practice, this will not happen often - it is unlikely we will feel it is worthwhile 
designing a reorder buffer and register set that will allow six results to be moved at once. So when it 
does happen, or whenever more instructions complete than we can move simultaneously, the pipeline 
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will stall. (Actually it may not all stall - only the execution unit(s) that wait.) Also, a reorder buffer is an 
additional place that execution units (or issue units) must look for operands in addition to the actual 
registers, and generally somewhere else as well (as we will see when we look at algorithms)- This again 
means more work to do, and potentially worse performance. 

An alternative is instead to provide a large set of registers and dynamically decide at any point in time 
which ones represent the actual architectural registers. For example, in a machine with 16 architectural 
registers we might provide a set of 64 physical registers. At any point in time, only 1 6 of these will 
correspond with the actual architectural register set - precisely which 16 changing as the program 
executes. This method - usually just called Register Renaming - solves the problems above. However, it 
has problems of its own. One is deciding at any one time which registers are live - that is, contain results 
that are still needed. Note that the live registers are not exactly the same set of registers as those 
corresponding with the architectural ones - determining those is another problem. Most of the time it 
does not matter which registers correspond with the architectural ones. However, when a context switch 
occurs it is necessary to know which ones to save - so there must be a means of finding this out This is 
potentially a fair amount of work - however much of it can be done in parallel with the operation of the 
rest of the pipeline, meaning it is not on the critical path and will not slow it down. Hie Pentium 4 has 
chosen to change from a reorder buffer to register renaming. 
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Elaboration: Memory accesses benefit from ^ c ^f 7 ^'^^Z 
servicing cache accesses during a cache miss (see Chapter 7). Out -of-0 der e™**™ 
pSJrs need nonlocking caches to allow instructions to execute dunng a n» 



Real Stuff: PowerPC 604 and Pentium 
Pro Pipelines 

Dynamically scheduled pipelines are used in both *e PowerPC 604 and the 
Pentium Pro. They have such similar pipeline conizations that we use a sin 
gle generic drawing. Figure 6.62, to describe both. Figure ^ 27 
Shows the silicon area required by dynamic pipelining m the PenhumPro^ 

The instruction cache fetches 16 bytes ^instructions and sends them to an 
instruction queue: four instructions for the PowerPC and a vanable number of 
instructions for the Pentium Pro. Next, several instructions are fetched and de- 
SSSioth processor, use a Sentry branch history table to pred£ branchy 

and speculatively execute instructions after the predicted branch. The dis 

patcheTunit sends each instruction and its operands to the ^^ons^on 

of one of the six functional units. ThedjspjiidliriU^^ 

struction in jhgtt^Jnj ffer of the com mit unitjhus an u^trucbon caimot 
llsuVuidesslh^lslpace available in both an appropriate reservation station 

and in the reorder buffer. f 
With so many instructions executing at the same time we^^n°utof 

utaees-to keep res ults. Both processors ha ve extra internal registers, called re- 

S^iS ^ re&tk that are ^ ^^^^^^^ 



name Duffers or rename tcxwio, ThaArcnde 

the commit unit to commit the result to one of ttie real re^^-Tbe dec^e 
unit is where rename buffers get assigned, thereby reducing hazards on reps 
Ambers. Whenever an entry in the reservation station has all i s operands 
and the associated functional unit is available, the operation is performed. 

The commit unit keeps track of all the pending instructions in its reorder 
buffer. Because both machines use branch prediction. an insb^cboiv ^fin- 
ished until the commit unit says it is. When die branch 
uuiu* whemer or not a branch was ta^^ 

unit, so that it can update its state machine, and the commit unit, so that it can 
decide the fate of pending instructions. If the predion was accurate, the re- 
sults of the instructions after die branch are marked valid and thus ca* l be 
olaced in the proerammer-visible registers and memory. If a misprediction 
££££ ^KeTSructions aftS the branch are marked invalid and dis- 
carded from the reservation stabons and reorder buffer. . , 
The commit unit can commit several instructions per clock cycle. To provide 
precise behavior during exceptions, the commit unit makes sure the mstruc- 
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tions commit in the order they were issued. Thus the commit unit cannot com- 
mit an instruction until the operation is finished in the functional unit, ail 
branches on which it might depend are resolved, and all instructions issued be- 
fore it have committed- 
Figure 6.63 lists the specific parameters for the PowerPC 604 and Pentium 
Pro pipelines. Many of the differences between the Pentium Pro and the 
VPowerPC »>04 are cosmetic. The largest difference, not surprisingly, is in decode 
and dispatch. As described in section 5.7, rather than try to pipeline variable- 
length 80x86 instructions, the Pentium Pro decode unit translates the Intel in- 
structions into 72-bit, fixed-length rnicrooperations, and then sends these 
microopetations to the reorder buffer and reservation stations. This translation 
takes 1 clock cycle to determine the length of the 80x86 instructions and then 2 
more to create the microoperations. 

As discussed in Chapter 5, these microoperations have two source registers 
and one destination register, and are similar to MIPS instructions. Most 80x86 
instructions are translated into one to four microoperations, but the really com- 
plex 80x86 instructions are executed by a conventional microprogram that 
issues long sequences of microoperations. 





PowerPC C041. 


"? Pentium:} 

;v. pro.:,:: . 


Maximum r.umber of instructions issued per clock cycle 


4 


3 


Maximum number of instructions completing execution per clock 
cycle 


6 


5 


Maximum numbor or instructions commlttea per clock cycle 


6 


3 


Number of bytes fetched from Instruction cocne 


16 


16 


Number or bytes In instruction queue 


32 


32 


Number of Instructions In reorder buffer 


1$ 


40 


Number of entries In branch table buffer 


512 


512 


Number ot history blis per entry in branch History buffer 


2 


4 


Number of rename buffers 


12 Integer + 8 FP 


40 


Total numter or reservation stations 


12 


20 


Total number of functional units 


6 




Number of integer Tuncoonal units 


2 


2 


Number of complex Integer operation functional units 


1 


0 


Number or floating-point functional units 


1 


1 


Number of branch functional units 


1 


1 


Number of memory functional units 


1 for both 
joaa ano store 


l for load 
+ 1 for store 



FIGURE 0.69 Spodf Ic parameter** <*f the PowerPC 604 and Pentium Pro In Figure 0.62. 
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